The goal of this project is to know whether a savings customer will take a credit or not. We have different sources of data, including savings account transactions, ZIP code, ATM geographical and transactional information and open data regarding crime and sociodemographic areas in Mexico.
In this document we present an exploratory data analysis on the available information. There are around 12 million savings customers and 800 thousand credit AND savings customers in Banco Azteca (BAZ), from which we have a sample of 1 million people for savings. The analysis in this document is based on the information of this sample and the whole population of credit customers.
| abonos | abonos_monto | retiros | retiros_monto | num_meses | tiempo_meses | freq | |
|---|---|---|---|---|---|---|---|
| 0% | 0 | 0.000 | 0 | -39804335.20 | 1 | 1 | 0.0312500 |
| 5% | 1 | 1.000 | 0 | -135549.10 | 1 | 6 | 0.0526316 |
| 10% | 1 | 50.000 | 0 | -80000.00 | 1 | 8 | 0.0714286 |
| 15% | 1 | 150.000 | 1 | -55267.97 | 2 | 11 | 0.0967742 |
| 20% | 1 | 554.000 | 1 | -40600.00 | 2 | 13 | 0.1250000 |
| 25% | 2 | 1212.975 | 2 | -30724.68 | 3 | 14 | 0.1428571 |
| 30% | 2 | 2075.000 | 2 | -23850.00 | 3 | 16 | 0.1666667 |
| 35% | 3 | 3200.000 | 3 | -18616.00 | 4 | 18 | 0.2000000 |
| 40% | 3 | 4636.000 | 4 | -14750.00 | 4 | 19 | 0.2222222 |
| 45% | 4 | 6200.000 | 5 | -11500.00 | 5 | 21 | 0.2500000 |
| 50% | 5 | 8350.000 | 7 | -9006.00 | 5 | 23 | 0.2812500 |
| 55% | 6 | 10800.000 | 8 | -6940.00 | 6 | 25 | 0.3125000 |
| 60% | 8 | 14000.000 | 10 | -5170.00 | 7 | 27 | 0.3333333 |
| 65% | 9 | 17850.000 | 12 | -3897.79 | 8 | 29 | 0.3600000 |
| 70% | 11 | 22917.640 | 15 | -2650.00 | 9 | 30 | 0.3750000 |
| 75% | 14 | 29950.000 | 19 | -1650.00 | 10 | 31 | 0.3846154 |
| 80% | 18 | 39550.000 | 25 | -900.00 | 10 | 32 | 0.4210526 |
| 85% | 24 | 54000.000 | 32 | -300.00 | 11 | 32 | 0.5000000 |
| 90% | 33 | 78500.000 | 45 | 0.00 | 12 | 32 | 0.5882353 |
| 95% | 53 | 132810.063 | 72 | 0.00 | 12 | 32 | 0.7500000 |
| 100% | 60199 | 42131739.960 | 22009 | 0.00 | 12 | 32 | 1.0000000 |
Number of months / Total months: Value between 0 and 1. If it’s 1 it means that the customer made an activity in all of the months that are available in the data; if it’s 0 it means that no activity took place. The value of 0 is not possible in this database because these are customers with at least one transaction.
| credit_or_savings | electronic_banking | proportion |
|---|---|---|
| credit | 0 | 0.9509844 |
| credit | 1 | 0.0490156 |
| savings | 0 | 0.9553800 |
| savings | 1 | 0.0446190 |
| credit_or_savings | active_electronic_banking | proportion |
|---|---|---|
| credit | 0 | 0.9686378 |
| credit | 1 | 0.0313622 |
| savings | 0 | 0.9772680 |
| savings | 1 | 0.0227310 |
It can be seen that the number of people taking credits has been decreasing.
We have information about the customers’ ZIP code. This information could be used, with public available information from sources like INEGI, to know the socioeconomic level of each savings customer.
Available sources:
AGEB stands for Área GeoEstadística Básica (Basic Geostatistical Area), and a locality is a general term used by CONAPO to define several AGEBs.
This document uses information from the socioeconomic regions defined by INEGI and margination index by locality defined by CONAPO.
ZIP code geographical information is available. According to the official postal code webpage, there are 32,448 different ZIP codes in Mexico, from which around 25,000 are available as shape files. The official ZIP code shapefiles are available in the open data government webpage, but not all them are available yet, the mexican postal service is still working in finding the delimiters of each code. Other resources are available, for example, a non-official collection of shapefiles of neighborhoods and ZIP codes. In addition, Google’s API for geocoding is a useful tool which is used as a last resort to find information about some ZIP codes.
Even with all this available information, there’s still a problem, which is that there are a bunch of ZIP codes which aren’t officially assigned to any human settlement but that are being used by people due to tradition or misinformation. So, geographic information may not be available for all customers, but it will be for most of them.
The polygons defining the ZIP codes aren’t equivalent to the polygons defining the AGEBs and localities, so a mapping between them is needed to be able to use the public available information. Perhaps the simplest solution is to find the centroid of each ZIP code and AGEB or locality, and then just map a given ZIP code to the closest AGEB or locality centroid.
We have a classification for each AGEB that pretends to show the differences among AGEBs based on indicators related with housing, education, health and employment, built from the last population census. Each AGEB can be classified in 7 strata such that stratum 7 contains AGEBs with the most favorable average conditions, and in stratum 1 are the AGEBs with the least favorable average conditions.
In the next images, maps of Mexico City and surroundings, Monterrey and Guadalajara are shown.
Map of Mexico City with centroids of each polygon:
Now, same map for Guadalajara, Jalisco:
And finally, for Monterrey, Nuevo León:
ZIP code information with their centroids can be seen in the next map of Mexico City:
ZIP code information with their centroids can be seen in the next map of Guadalajara. Some of the centroids may not match perfectly the polygon plotted because the database considers a the ZIP code and the identifier as a different group.
ZIP code information with their centroids can be seen in the next map of Monterrey:
Finally, plotting the centroids of AGEBs and ZIP codes in Mexico City altogether we get:
Guadalajara:
Monterrey:
So, for each available ZIP code, the closest AGEB centroid is found and a mapping is made to assign an AGEB to each ZIP code, such that we get a table in the following format:
| ZIP | ZIP long | ZIP lat | Nearest AGEB | AGEB long | AGEB lat | Distance in Km | Classification |
|---|---|---|---|---|---|---|---|
| 56364 | -98.93143 | 19.44496 | 1.503100e+12 | -98.93469 | 19.44372 | 0.3680725 | 3 |
| 56367 | -98.95076 | 19.44106 | 1.503100e+12 | -98.94869 | 19.43894 | 0.3201608 | 4 |
| 56365 | -98.94247 | 19.43852 | 1.503100e+12 | -98.94134 | 19.43799 | 0.1325068 | 4 |
| 96340 | -94.60759 | 18.00084 | 3.004801e+12 | -94.60721 | 18.00117 | 0.0547658 | 6 |
| 42850 | -99.33818 | 19.92243 | 1.306300e+12 | -99.33511 | 19.91824 | 0.5655460 | 6 |
| 57850 | -98.97560 | 19.38088 | 1.505800e+12 | -98.97690 | 19.38002 | 0.1661747 | 6 |
| 97300 | -89.70512 | 21.01598 | 3.110000e+12 | -89.74094 | 21.02427 | 3.8302693 | 2 |
| 61531 | -100.37365 | 19.42391 | 1.611200e+12 | -100.37496 | 19.42216 | 0.2384809 | 4 |
| 41706 | -98.41225 | 16.69447 | 1.204600e+12 | -98.40838 | 16.69271 | 0.4568835 | 4 |
| 53750 | -99.24115 | 19.45593 | 1.505700e+12 | -99.24014 | 19.45617 | 0.1088229 | 6 |
In the following graph, a histogram is plotted showing the distribution of the distance between the centroid of the ZIP code and the centroid of the AGEB. The red lines represent quantiles 0.5, 0.75, 0.9 and 0.95. As can be seen, most of the mass is concentrated in distances shorter than 10 Km. This may seem like little, but in the case of a city, the landscape can change dramatically in 10 Km.
In the following graph, the distance histogram is plotted once more, but with with a different graph depending on whether the ZIP code is in a rural, urban, semiurban or unknown type of area. In the urban and semiurban areas, more than 95% of ZIP codes are within a 2.5 Km distance from the closest centroid. The rural areas are the ones that have a shorter tail, which seems reasonable because rural areas are usually larger and AGEB information is scarse in these areas.
The following graph shows the distribution of the distance of the 4 main states in Mexico.
The next graph combines the data of the last two graphs: it shows the distance distribution depending on whether the area is rural, urban, semiurban or unknown and on whether the ZIP code is in any of the 4 biggest states in Mexico. Once more, in the urban and semiurban areas the distance is smaller than in rural areas.
This approach may fail in the rural areas and also, as can be noted, ZIP code polygons are generally bigger in area than AGEBs, so the heterogeneity of each ZIP code is being ignored.
CONAPO (Comisión Nacional de Población, Population National Commission) makes a margination index by locality, defining margination as the set of social problems or disadvantages of a community or locality. The index pretends to summarize characteristics of the environment in which people live in using information of:
The index is computed using Principal Coponent Analysis, the index is the the first component, which is best explained by the absence of refrigerator and percentage of people without primary education and percentage of people who can’t read or write. This index is the classified in five categories of margination: very low, low, medium, high and very high.
In the next images, maps of Mexico City and surroundings, Monterrey and Guadalajara are shown.
Map of Mexico City with centroids of each polygon:
Now, same map for Guadalajara, Jalisco:
And finally, for Monterrey, Nuevo León:
It can be seen in the previous maps that the localities in the main cities in Mexico are much bigger than the AGEBs and the ZIP codes, so the information provided by CONAPO may not give a good estimate of the level of each ZIP code because localities are more homogeneous.
First, let’s see what’s the distribution of the classification of AGEBs in the country. Remember that 7 is that the AGEB is “good” in average and that 1 is that it’s “bad”.
And now, the mapping of the ZIP codes:
The distribution changed considerably. As we can see in the following graph, originally the AGEBs were urban (U) and rural (R), but the mapping consists of only urban ZIP codes; so this may be a reason of why the distribution changed so much.
And now let’s analyze the sample with 1 million savings customers and circa 800 thousand credit customers.
Out of the 1859441, we have the mapping ZIP code for 1590674 of them, which are distributed the following way:
And now, conditioning on whether it’s a credit or savings customer:
Using information about crime reports we create four indexes that together give us a picture of the crime in the region. The indexes that we produce are:
Crime dimension: this index give us a summarized idea of the total crime in the region.
Non violent crime dimension: this index tell us about the number of non violent crimes in the region.
Violent crime dimension: this index tell us how about the number of violent crimes in the region.
Kidnap dimension: this index tell us about the number of kidnaps in the region.